Skip to content

Supermarket python project

Importing all needful libraries and csv file with data

import pandas as pd
import math
import plotly.express as px
import plotly.graph_objects as go
from IPython.core.display import display, HTML


sales = pd.read_csv('supermarket_sales - Sheet1.csv', header=0)
sales
/var/folders/8q/d84brk7924n4jpj915_z1nm00000gn/T/ipykernel_21161/1561358757.py:5: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython.display
  from IPython.core.display import display, HTML
Invoice ID Branch City Customer type Gender Product line Unit price Quantity Tax 5% Total Date Time Payment cogs gross margin percentage gross income Rating
0 750-67-8428 A Yangon Member Female Health and beauty 74.69 7 26.1415 548.9715 1/5/2019 13:08 Ewallet 522.83 4.761905 26.1415 9.1
1 226-31-3081 C Naypyitaw Normal Female Electronic accessories 15.28 5 3.8200 80.2200 3/8/2019 10:29 Cash 76.40 4.761905 3.8200 9.6
2 631-41-3108 A Yangon Normal Male Home and lifestyle 46.33 7 16.2155 340.5255 3/3/2019 13:23 Credit card 324.31 4.761905 16.2155 7.4
3 123-19-1176 A Yangon Member Male Health and beauty 58.22 8 23.2880 489.0480 1/27/2019 20:33 Ewallet 465.76 4.761905 23.2880 8.4
4 373-73-7910 A Yangon Normal Male Sports and travel 86.31 7 30.2085 634.3785 2/8/2019 10:37 Ewallet 604.17 4.761905 30.2085 5.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 233-67-5758 C Naypyitaw Normal Male Health and beauty 40.35 1 2.0175 42.3675 1/29/2019 13:46 Ewallet 40.35 4.761905 2.0175 6.2
996 303-96-2227 B Mandalay Normal Female Home and lifestyle 97.38 10 48.6900 1022.4900 3/2/2019 17:16 Ewallet 973.80 4.761905 48.6900 4.4
997 727-02-1313 A Yangon Member Male Food and beverages 31.84 1 1.5920 33.4320 2/9/2019 13:22 Cash 31.84 4.761905 1.5920 7.7
998 347-56-2442 A Yangon Normal Male Home and lifestyle 65.82 1 3.2910 69.1110 2/22/2019 15:33 Cash 65.82 4.761905 3.2910 4.1
999 849-09-3807 A Yangon Member Female Fashion accessories 88.34 7 30.9190 649.2990 2/18/2019 13:28 Cash 618.38 4.761905 30.9190 6.6

1000 rows × 17 columns

Descriptive statistics

There is mean and standart deviation of different fields

sales.describe()
Unit price Quantity Tax 5% Total cogs gross margin percentage gross income Rating
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000 1000.00000
mean 55.672130 5.510000 15.379369 322.966749 307.58738 4.761905 15.379369 6.97270
std 26.494628 2.923431 11.708825 245.885335 234.17651 0.000000 11.708825 1.71858
min 10.080000 1.000000 0.508500 10.678500 10.17000 4.761905 0.508500 4.00000
25% 32.875000 3.000000 5.924875 124.422375 118.49750 4.761905 5.924875 5.50000
50% 55.230000 5.000000 12.088000 253.848000 241.76000 4.761905 12.088000 7.00000
75% 77.935000 8.000000 22.445250 471.350250 448.90500 4.761905 22.445250 8.50000
max 99.960000 10.000000 49.650000 1042.650000 993.00000 4.761905 49.650000 10.00000

Median of different columns

print(sales.Quantity.median())
print(sales.Rating.median())
print(int(sales.Total.median()))
5.0
7.0
253

Looking at columns in detail

columns_names = list(sales.columns)
columns = pd.DataFrame(columns_names, columns = ['Names of columns'])
columns
Names of columns
0 Invoice ID
1 Branch
2 City
3 Customer type
4 Gender
5 Product line
6 Unit price
7 Quantity
8 Tax 5%
9 Total
10 Date
11 Time
12 Payment
13 cogs
14 gross margin percentage
15 gross income
16 Rating
Looking on different product lines presented in dataset
d = pd.DataFrame(set(list(sales['Product line'])))
d
0
0 Food and beverages
1 Fashion accessories
2 Home and lifestyle
3 Sports and travel
4 Electronic accessories
5 Health and beauty

Data cleanup

Checking for zero and "NaN" elements

sales.count()
Invoice ID                 1000
Branch                     1000
City                       1000
Customer type              1000
Gender                     1000
Product line               1000
Unit price                 1000
Quantity                   1000
Tax 5%                     1000
Total                      1000
Date                       1000
Time                       1000
Payment                    1000
cogs                       1000
gross margin percentage    1000
gross income               1000
Rating                     1000
dtype: int64

Therefore there is no such elements

Format column names

sales['Total'] = sales['Total'].apply(lambda x: round(x, 2))
sales['gross income'] = sales['gross income'].apply(lambda x: round(x, 2))
sales = sales.drop(['Invoice ID', 'Tax 5%', 'Branch', 'gross margin percentage', 'cogs'], axis = 1)
sales.rename(columns={'gross income': 'Gross income'}, inplace=True)
sales
City Customer type Gender Product line Unit price Quantity Total Date Time Payment Gross income Rating
0 Yangon Member Female Health and beauty 74.69 7 548.97 1/5/2019 13:08 Ewallet 26.14 9.1
1 Naypyitaw Normal Female Electronic accessories 15.28 5 80.22 3/8/2019 10:29 Cash 3.82 9.6
2 Yangon Normal Male Home and lifestyle 46.33 7 340.53 3/3/2019 13:23 Credit card 16.22 7.4
3 Yangon Member Male Health and beauty 58.22 8 489.05 1/27/2019 20:33 Ewallet 23.29 8.4
4 Yangon Normal Male Sports and travel 86.31 7 634.38 2/8/2019 10:37 Ewallet 30.21 5.3
... ... ... ... ... ... ... ... ... ... ... ... ...
995 Naypyitaw Normal Male Health and beauty 40.35 1 42.37 1/29/2019 13:46 Ewallet 2.02 6.2
996 Mandalay Normal Female Home and lifestyle 97.38 10 1022.49 3/2/2019 17:16 Ewallet 48.69 4.4
997 Yangon Member Male Food and beverages 31.84 1 33.43 2/9/2019 13:22 Cash 1.59 7.7
998 Yangon Normal Male Home and lifestyle 65.82 1 69.11 2/22/2019 15:33 Cash 3.29 4.1
999 Yangon Member Female Fashion accessories 88.34 7 649.30 2/18/2019 13:28 Cash 30.92 6.6

1000 rows × 12 columns

Plots

Amount of products by each product line

dfk = sales.groupby("Product line")['Quantity'].sum()
dfk = pd.DataFrame(dfk)
dfk.reset_index(inplace=True)
dfk = dfk.sort_values('Quantity', ascending=True, ignore_index=True)
fig = px.bar(dfk, x='Product line', y='Quantity')
fig.write_html('g1.html')
display(HTML('g1.html'))

Product line comparison to unit price

fig=px.box(sales,x='Product line',color='Product line',y='Unit price')
fig.show()

Amount of total sum of purchases by product lined

df = pd.DataFrame({'Line': sales['Product line'], 'City': sales['City'], 'Total': sales['Total']})
df = pd.DataFrame(df.groupby('Line')['Total'].sum())
fig = px.bar(df.reset_index(), x='Line', y='Total')
fig.show()

Amount of total sum of purchases by city

df = pd.DataFrame({'line': sales['Product line'], 'City': sales['City'], 'Total': sales['Total']})
df = pd.DataFrame(df.groupby('City')['Total'].sum())
fig = px.bar(df.reset_index(), x='City', y='Total')
fig.show()

Rating by time of purchases

sale = pd.DataFrame({'Time': sales['Time'], 'Rating': sales['Rating'], 'Total': sales['Total']})
sale = sale.sort_values('Time')
fig = px.scatter(sale, x="Time", y="Rating")
fig.show()

Total sum of purchases by time of purchases

fig = px.scatter(sale, x="Time", y="Total")
fig.show()

Relation of payment methods and total sum of purchases

fig1 = px.pie(sales, names='Payment', values='Total')
fig1.show()

Relation between membership and total sum of purchases and rating

fig = px.pie(sales, values="Total", names="Customer type")
fig.show()
fig = px.pie(sales, values="Rating", names="Customer type")
fig.show()

Detailed overview

What product line gives more total sum of products and in what city

df = pd.DataFrame({'line': sales['Product line'], 'city': sales['City'], 'total': sales['Total']})
df = df.sort_values(by='city', ignore_index=True)
fig = px.sunburst(df, path=['city', 'line'], values='total')
fig.update_layout(margin = dict(t=1, l=1, r=1, b=1))
fig.show()

Membershiping by sex

df = pd.DataFrame({'sex': sales['Gender'], 'Customer type': sales['Customer type']})
male_member = 0
male_notmem = 0
female_member = 0
female_notmem = 0
df_sorted = df.sort_values(by='sex', ascending=False)
#count male member/not member
for i in range(501):    
    if df_sorted.iloc[i, 1] == 'Member':
        male_member += 1
    else:
        male_notmem += 1
for i in range(501, 1000):    
    if df_sorted.iloc[i, 1] == 'Member':
        female_member += 1
    else:
        female_notmem += 1
df1 = pd.DataFrame({'male': [male_member, male_notmem] , 'female': [female_member, female_notmem], 'membership': ['member', 'not_member']})

df_transformed = df1.melt(id_vars='membership', var_name='sex', value_name='count')
fig = px.bar(df_transformed, x='sex', y='count', color='membership', barmode='group', height=350)
fig.show()

Total sum and it dependence of date, time and city

ndf = pd.DataFrame({'date': sales['Date'], 'time': sales['Time'], 'city': sales['City'], 'total': sales['Total']})
ndf = ndf.sort_values(['date', 'time'], ignore_index=True, ascending=True)
fig = px.scatter_3d(ndf, x='date', y='time', z='total',
              color='city', size_max=18,
              symbol='city', opacity=0.7)

# tight layout
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()

We can note that there are interesting trendline each city

Lets look more datailed on them

Mandalay


ndf = ndf[ndf['city'] == 'Mandalay']
ndf = ndf.sort_values(['date', 'time'], ignore_index=True, ascending=True)
fig = px.scatter_3d(
    ndf,
    x='date',
    y='time',
    z='total',
    color='city',
    size_max=18,
    symbol='city',
    opacity=0.7
)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()
fig = px.scatter(ndf, x="time", y="date")
fig.show()
ndf = ndf.sort_values(['time', 'date'], ignore_index=True, ascending=True)
fig = px.scatter(ndf, x="time", y="date")
fig.show()

Yangon

ndf = pd.DataFrame({'date': sales['Date'], 'time': sales['Time'], 'city': sales['City'], 'total': sales['Total']})
ndf = ndf[ndf['city'] == 'Yangon']
ndf = ndf.sort_values(['date', 'time'], ignore_index=True, ascending=True)
fig = px.scatter_3d(
    ndf,
    x='date',
    y='time',
    z='total',
    color='city',
    size_max=18,
    symbol='city',
    opacity=0.7
)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()
ndf = ndf.sort_values(['date', 'time'], ignore_index=True, ascending=True)

fig = px.scatter(ndf, x="time", y="date")
fig.show()
ndf = ndf.sort_values(['time', 'date'], ignore_index=True, ascending=True)
fig = px.scatter(ndf, x="time", y="date")
fig.show()

Naypyitaw

ndf = pd.DataFrame({'date': sales['Date'], 'time': sales['Time'], 'city': sales['City'], 'total': sales['Total']})
ndf = ndf[ndf['city'] == 'Naypyitaw']
ndf = ndf.sort_values(['date', 'time'], ignore_index=True, ascending=True)
fig = px.scatter_3d(
    ndf,
    x='date',
    y='time',
    z='total',
    color='city',
    size_max=18,
    symbol='city',
    opacity=0.7
)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()
ndf = ndf.sort_values(['date', 'time'], ignore_index=True, ascending=True)

fig = px.scatter(ndf, x="time", y="date")
fig.show()
ndf = ndf.sort_values(['time', 'date'], ignore_index=True, ascending=True)
fig = px.scatter(ndf, x="time", y="date")
fig.show()

Overall, we can notice similar graph trends for each city This may be due to different factors: working hours, growing popularity of the supermarket and etc.

Average total sum in half hours

data = []

for i in range(10, 21):
    first_halfhour = generate_time_list(i, 0, i, 29)
    sec_halfhour = generate_time_list(i, 30, i, 59)

    total_f = total_s = cnt = cnto = 0

    for j in range(1000):
        if dfr['time'].iloc[j] in first_halfhour:
            cnt += 1
            total_f += int(dfr['total'].iloc[j])

        if dfr['time'].iloc[j] in sec_halfhour:
            cnto += 1
            total_s += int(dfr['total'].iloc[j])

    totalfi = round(total_f / cnt, 2) if cnt else 0
    totalse = round(total_s / cnto, 2) if cnto else 0
    data.append([f"{str(i).zfill(2)}:00 - {str(i).zfill(2)}:29", totalfi])
    data.append([f"{str(i).zfill(2)}:30 - {str(i).zfill(2)}:59", totalse])

# Create a DataFrame from the data
df_result = pd.DataFrame(data, columns=['Time Range', 'Average Total'])
df_result
Time Range Average Total
0 10:00 - 10:29 262.82
1 10:30 - 10:59 357.47
2 11:00 - 11:29 360.95
3 11:30 - 11:59 316.12
4 12:00 - 12:29 273.05
5 12:30 - 12:59 307.42
6 13:00 - 13:29 353.33
7 13:30 - 13:59 322.05
8 14:00 - 14:29 310.98
9 14:30 - 14:59 426.67
10 15:00 - 15:29 339.83
11 15:30 - 15:59 266.19
12 16:00 - 16:29 393.54
13 16:30 - 16:59 271.79
14 17:00 - 17:29 373.14
15 17:30 - 17:59 288.92
16 18:00 - 18:29 297.94
17 18:30 - 18:59 256.90
18 19:00 - 19:29 366.04
19 19:30 - 19:59 338.32
20 20:00 - 20:29 278.70
21 20:30 - 20:59 332.16

Look at origin graph

fig = px.scatter(sale, x="Time", y="Total")
fig.show()
fig = px.line(df_result, x="Time Range", y="Average Total")
fig.show()

Change in the Total price of Country according to date

px.bar(sales, x='City',y='Total',color='City',animation_frame='Date',
      animation_group="City", range_y=[0,1000])

Data transformation

Creating two new columns of different interesting ratios

sales['Price-rating ratio'] = sales['Total']/sales['Rating']
sales['Price-rating ratio'] = sales['Price-rating ratio'].astype('int64')
sales['Qnty/price'] = sales['Total']/sales['Quantity']
sales['Qnty/price'] = sales['Qnty/price'].apply(lambda x: round(x, 2))
sales

City Customer type Gender Product line Unit price Quantity Total Date Time Payment Gross income Rating Price-rating ratio Qnty/price
0 Yangon Member Female Health and beauty 74.69 7 548.97 1/5/2019 13:08 Ewallet 26.14 9.1 60 78.42
1 Naypyitaw Normal Female Electronic accessories 15.28 5 80.22 3/8/2019 10:29 Cash 3.82 9.6 8 16.04
2 Yangon Normal Male Home and lifestyle 46.33 7 340.53 3/3/2019 13:23 Credit card 16.22 7.4 46 48.65
3 Yangon Member Male Health and beauty 58.22 8 489.05 1/27/2019 20:33 Ewallet 23.29 8.4 58 61.13
4 Yangon Normal Male Sports and travel 86.31 7 634.38 2/8/2019 10:37 Ewallet 30.21 5.3 119 90.63
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 Naypyitaw Normal Male Health and beauty 40.35 1 42.37 1/29/2019 13:46 Ewallet 2.02 6.2 6 42.37
996 Mandalay Normal Female Home and lifestyle 97.38 10 1022.49 3/2/2019 17:16 Ewallet 48.69 4.4 232 102.25
997 Yangon Member Male Food and beverages 31.84 1 33.43 2/9/2019 13:22 Cash 1.59 7.7 4 33.43
998 Yangon Normal Male Home and lifestyle 65.82 1 69.11 2/22/2019 15:33 Cash 3.29 4.1 16 69.11
999 Yangon Member Female Fashion accessories 88.34 7 649.30 2/18/2019 13:28 Cash 30.92 6.6 98 92.76

1000 rows × 14 columns

Two example graph using new columns

fig = px.pie(sales, values='Price-rating ratio', names='City')
fig.show()
fig = px.bar(sales, x='Qnty/price', y='Product line')
fig.show()

Hypothesis

City and customer type influence total

grouped_data = sales.groupby(['City', 'Customer type', 'Product line'])['Total'].sum().reset_index()

# Create a faceted bar chart
fig1 = px.bar(grouped_data, x='Product line', y='Total', color='Customer type',
              facet_col='City')
fig1.update_layout(height=400, width=1000)
fig1.show()

In each city, normal customers consistently contribute more to total sales across all product lines, with significant variation in sales by product type.

Higher Unit Prices Lead to Lower Quantities Sold

fig1 = px.scatter(sales, x='Quantity', y='Unit price', trendline='lowess')
fig1.show()

The flat trendline indicates no clear relationship between unit price and quantity sold, suggesting that price changes do not significantly impact sales quantity.

Naypyitaw has more total sales per each customer type

city_customer_sales = sales.groupby(['City', 'Customer type'])['Total'].sum().reset_index()

# Create a bar chart
fig1 = px.bar(city_customer_sales, x='City', y='Total', color='Customer type')
fig1.show()

Naypyitaw has the highest total sales for both member and normal customer types compared to Mandalay and Yangon

Rating is not infuenced by gross income, total and payment method

fig = px.scatter(sale, x="Total", y="Rating", trendline='lowess')
fig.show()

The trendline is nearly flat, suggesting that the total sales amount does not significantly affect customer ratings.

fig3 = px.scatter(sales, x='Gross income', y='Rating', trendline='lowess')
fig3.show()


Similar to the previous plot, the trendline remains flat, indicating that gross income has little to no impact on customer ratings.

fig4 = px.box(sales, x='Payment', y='Rating')
fig4.show()

The ratings are similar across eWallet, cash, and credit card payment methods, with overlapping interquartile ranges.

The overall hypothesis that rating is not influenced by gross income, total sales, or payment method is supported by the data. Each graph consistently shows no significant correlation between these variables and customer ratings.